This tutorial borrows heavily from Judith Degen’s Using ggplot2 to visualize data tutorial and the Introduction to R Graphics with ggplot2 from the Data Science Services at Harvard’s IQSS.
ggplot2 is a langauge for creating graphics in R. Its author, Hadley Wickham, based his language on “The Grammar of Graphics” (Wikinson, 2005).
You should have R installed. If you do not have R installed, then:
Download the tutorial materials:
ggplot2.zipggplot2.R in RStudioInstall the ggplot2 package by typing install.packages("ggplot2") into your RStudio console. Then, load the ggplot2 package:
library(ggplot2)
ggplot2?There are some clear advantages of ggplot2 over the base graphics system included with R:
theme system for polishing plot appearanceStill, the IQSS tutorial points to some clear limitations:
rgl package instead)igraph package instead)ggvis package instead)Graphics are data visualizations: mappings from data to aesthetic attributes of geometric objects. Building blocks of a graphic include:
The languageR package contains lots of handy information, including the lexdec dataset: lexical decsion latencies elicited from 21 subjects for 79 English nouns, with information tied to the participants and nouns.
Load the languageR package and the lexdec dataset:
library(languageR)
data(lexdec)
head(lexdec)
## Subject RT Trial Sex NativeLanguage Correct PrevType PrevCorrect
## 1 A1 6.340359 23 F English correct word correct
## 2 A1 6.308098 27 F English correct nonword correct
## 3 A1 6.349139 29 F English correct nonword correct
## 4 A1 6.186209 30 F English correct word correct
## 5 A1 6.025866 32 F English correct nonword correct
## 6 A1 6.180017 33 F English correct word correct
## Word Frequency FamilySize SynsetCount Length Class FreqSingular
## 1 owl 4.859812 1.3862944 0.6931472 3 animal 54
## 2 mole 4.605170 1.0986123 1.9459101 4 animal 69
## 3 cherry 4.997212 0.6931472 1.6094379 6 plant 83
## 4 pear 4.727388 0.0000000 1.0986123 4 plant 44
## 5 dog 7.667626 3.1354942 2.0794415 3 animal 1233
## 6 blackberry 4.060443 0.6931472 1.3862944 10 plant 26
## FreqPlural DerivEntropy Complex rInfl meanRT SubjFreq meanSize
## 1 74 0.7912 simplex -0.3101549 6.3582 3.12 3.4758
## 2 30 0.6968 simplex 0.8145080 6.4150 2.40 2.9999
## 3 49 0.4754 simplex 0.5187938 6.3426 3.88 1.6278
## 4 68 0.0000 simplex -0.4274440 6.3353 4.52 1.9908
## 5 828 1.2129 simplex 0.3977961 6.2956 6.04 4.6429
## 6 31 0.3492 complex -0.1698990 6.3959 3.28 1.5831
## meanWeight BNCw BNCc BNCd BNCcRatio BNCdRatio
## 1 3.1806 12.057065 0.000000 6.175602 0.000000 0.512198
## 2 2.6112 5.738806 4.062251 2.850278 0.707856 0.496667
## 3 1.2081 5.716520 3.249801 12.588727 0.568493 2.202166
## 4 1.6114 2.050370 1.462410 7.363218 0.713242 3.591166
## 5 4.5167 74.838494 50.859385 241.561040 0.679589 3.227765
## 6 1.1365 1.270338 0.162490 1.187616 0.127911 0.934882
Pro-tip: begin by setting the background to white instead of gray.
theme_set(theme_bw())
You’ll also want to make sure you’ve set your working directory to the folder these files are in:
setwd("~/git/tutorials/ggplot2/")
Suppose you want to visualize the data in lexdec. We can start with a histogram of the response time distribution. We start by calling the ggplot function, which looks for a dataframe argument and an aesthetics argument. In the aes argument, we specify the x-axis.
ggplot(lexdec, aes(x=RT))
Now, we want to add a geom_histogram layer to our base plot:
ggplot(lexdec, aes(x=RT)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
It’s always helpful to add informative axis labels:
ggplot(lexdec, aes(x=RT)) +
geom_histogram() +
xlab("Log-transformed lexical decision times") +
ylab("Number of observations")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Each of the geom layers has attributions of its own that you can specify. Here, we specify the binwidth of geom_histogram:
ggplot(lexdec, aes(x=RT)) +
geom_histogram(binwidth=0.01) +
xlab("Log-transformed lexical decision times") +
ylab("Number of observations")
To save a plot you’ve created, use ggsave. Be sure to fine-tune the width and height arguments to suit your needs.
ggsave("plots/rt_histogram.png",width=5,height=4)
Often, histograms won’t suffice to visualize the patterns of data we’re interested in. To see how lexical decision times pattern as a function of word frequency, we can use the same dataset to create a scatterplot.
We begin as before, by suppling a dataframe and some aesthetics to ggplot2; now, we’ll want our x-axis to plot word frequency, while our y-axis plot lexical decision times.
ggplot(lexdec, aes(x=Frequency, y=RT))
To this base we add our geom_point layer and some reasonable axis labels:
ggplot(lexdec, aes(x=Frequency, y=RT)) +
geom_point() +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time")
We can get a better sense of what’s going on in our data by adding a smoothing layer with geom_smooth. Here, we specify the method to lm (short for linear model) so that it draws the line of best fit:
ggplot(lexdec, aes(x=Frequency, y=RT)) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time")
We can customise attributes of the plot, for example the color or size or shape or the opacity (i.e., alpha) of our points:
ggplot(lexdec, aes(x=Frequency, y=RT)) +
geom_point(color="red",size=5,shape=8,alpha=0.1) +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time")
We can also adjust our x- and y-axis limits, which will zoom in on the plot while deleting points outside the specified range.
ggplot(lexdec, aes(x=Frequency, y=RT)) +
geom_point(color="red",size=5,shape=8,alpha=0.25) +
geom_smooth(method="lm") +
xlim(c(2,3)) +
ylim(c(6,6.5)) +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time")
## Warning: Removed 1569 rows containing non-finite values (stat_smooth).
## Warning: Removed 1569 rows containing missing values (geom_point).
We can include a lot more information in our scatter plots by specifying various aesthetics. For example, we can color our points by the `Length’ of the words in question.
ggplot(lexdec, aes(x=Frequency, y=RT, color=Length)) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
labs(color="Word length\nin characters")
Rather than treating Length as a continuous collection of intergers, we can treat it as a factor with discrete levels. Doing so will now draw a different smoothing line for each possible word Length.
ggplot(lexdec, aes(x=Frequency, y=RT, color=as.factor(Length))) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
labs(color="Word length\nin characters")
We can also specify the shape aesthetic to communcation informatio, say the Class to which a given word belongs (animal vs. plant).
ggplot(lexdec, aes(x=Frequency, y=RT, color=as.factor(Length), shape=Class)) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
labs(color="Word length\nin characters", shape="Word class")
This plot has gotten awfully busy with all this information. We can clean things up a bit by faceting the graphic by Length and Class, rather than including all of the information on a single plot.
ggplot(lexdec, aes(x=Frequency, y=RT)) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
facet_grid(Class~Length)
As before, we can add more information to our plots with aesthetics, for example color by the NativeLanguage of the participants.
ggplot(lexdec, aes(x=Frequency, y=RT, color=NativeLanguage)) +
geom_point() +
geom_smooth(method="lm") +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
facet_grid(Class~Length)
If you don’t like the default colors, feel free to choose your own.
ggplot(lexdec, aes(x=Frequency, y=RT, color=NativeLanguage)) +
geom_point(alpha=0.25) +
geom_smooth(method="lm") +
scale_color_manual(values=c("blue","red")) +
xlab("Log-transformed lemma frequency") +
ylab("Log-transformed response time") +
facet_grid(Class~Length)
Scatterplots are useful when it comes to visualizing lots of data, but sometimes you want to focus in on summary statistics like mean values. In that case, you’re more likely to want a barplot to visualize your results.
It’s important to remember that the point estimate of a mean is useless information without some sense of the variation in the data that led to that point estimate. We can communicate that information with error bars representing confidence intervals.
The following helper file will load the bootsSummary function, which calculates these confidence intervals for you.
source("helpers.R")
Now, suppose we want to visualize the average response time for each possible word length. First we’ll use bootsSummary to calculate those values for us.
d_s = bootsSummary(data=lexdec, measurevar="RT", groupvars=c("Length"))
## Loading required package: plyr
d_s
## Length N RT bootsci_high bootsci_low
## 1 3 168 6.304878 6.329961 6.279402
## 2 4 210 6.380140 6.412060 6.348904
## 3 5 399 6.353711 6.376235 6.331764
## 4 6 315 6.370368 6.397010 6.344676
## 5 7 189 6.436170 6.473097 6.400388
## 6 8 210 6.460569 6.497751 6.424189
## 7 9 105 6.434549 6.486505 6.383188
## 8 10 63 6.400557 6.459428 6.344956
We’ll feed this information into a new ggplot, this time using geom_bar.
ggplot(d_s,aes(x=Length,y=RT))+
geom_bar(stat="identity")
To add error bars that represent our confidence intervals, we use geom_errorbar.
ggplot(d_s,aes(x=Length,y=RT))+
geom_bar(stat="identity") +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=Length))
We can pretty the plot up some by changing colors and attributes.
ggplot(d_s,aes(x=as.factor(Length),y=RT))+
geom_bar(stat="identity",fill="lightgray",color="darkgray") +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=as.factor(Length), width=0.25)) +
coord_cartesian(ylim=c(6,7)) +
xlab("Word length in characters") +
ylab("Log-transformed response time")
As before, we can add lots of information to this plot by specifying additional attributes. Suppose we want to keep track of the noun Class and NativeLanguage. We’ll need to compute new means that take these factors into account.
d_s2 = bootsSummary(data=lexdec, measurevar="RT", groupvars=c("Length","Class","NativeLanguage"))
head(d_s2)
## Length Class NativeLanguage N RT bootsci_high bootsci_low
## 1 3 animal English 96 6.265256 6.298025 6.232760
## 2 3 animal Other 72 6.357708 6.395312 6.321471
## 3 4 animal English 84 6.320975 6.366753 6.275990
## 4 4 animal Other 63 6.469473 6.525823 6.414192
## 5 4 plant English 36 6.324715 6.394033 6.262072
## 6 4 plant Other 27 6.429661 6.541089 6.334528
We can then use these values in a new barplot.
ggplot(d_s2,aes(x=as.factor(Length),y=RT,fill=NativeLanguage))+
geom_bar(stat="identity", position=position_dodge()) +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=as.factor(Length), width=0.25), position=position_dodge(0.9)) +
coord_cartesian(ylim=c(6,7)) +
xlab("Word length in characters") +
ylab("Log-transformed response time") +
facet_grid(Class~.)
You might want to further customize the plot attributes, for example with custom colors.
ggplot(d_s2,aes(x=as.factor(Length),y=RT,fill=NativeLanguage))+
geom_bar(stat="identity", position=position_dodge()) +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=as.factor(Length), width=0.25), position=position_dodge(0.9)) +
coord_cartesian(ylim=c(6,7)) +
scale_fill_manual(values=c("blue","red")) +
xlab("Word length in characters") +
ylab("Log-transformed response time") +
facet_grid(Class~.)
Let’s simplify things a bit and focus in on the effect of NativeLanguage on RT.
d_s3 = bootsSummary(data=lexdec, measurevar="RT", groupvars=c("NativeLanguage"))
head(d_s3)
## NativeLanguage N RT bootsci_high bootsci_low
## 1 English 948 6.318309 6.331171 6.305776
## 2 Other 711 6.474130 6.493372 6.454910
We can then visualize these values.
ggplot(d_s3,aes(x=NativeLanguage,y=RT))+
geom_bar(stat="identity", position=position_dodge(), fill="lightgray",color="darkgray") +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=NativeLanguage, width=0.25), position=position_dodge(0.9)) +
coord_cartesian(ylim=c(5,8)) +
xlab("Native Langauge") +
ylab("Log-transformed response time")
But suppose in addition to the means and the confidence intervals, we want to get a sense of the actual observations that led to these values. We can plot these observations by adding a geom_jitter layer using the original RT values from the lexdec dataframe.
ggplot(d_s3,aes(x=NativeLanguage,y=RT))+
geom_bar(stat="identity", position=position_dodge(), fill="lightgray",color="darkgray") +
geom_jitter(data=lexdec,aes(y=RT),alpha=.25,color="red") +
geom_errorbar(aes(ymin=bootsci_low, ymax=bootsci_high, x=NativeLanguage, width=0.25), position=position_dodge(0.9)) +
coord_cartesian(ylim=c(5,8)) +
xlab("Native Langauge") +
ylab("Log-transformed response time")
Violin plots provide another means of visualizing the distribution of responses. To understand violin plots, we start by visualizing the density of a distribution, in this case the distribution of RTs.
ggplot(lexdec,aes(x=RT))+
geom_density()
We can add information to this plot with additonal aesthetics.
ggplot(lexdec,aes(x=RT,fill=NativeLanguage))+
geom_density(alpha=0.5)
A violin plot effectively flips this information on its side:
ggplot(lexdec,aes(y=RT,x=NativeLanguage))+
geom_violin()
To add back in information about mean values, we can add a geom_boxplot layer.
ggplot(lexdec,aes(y=RT,x=NativeLanguage))+
geom_violin() +
geom_boxplot(notch=T,width=.25)
We could also add in information about the mean and confidence intervals; we’ve computed those values in d_s3 above.
ggplot(lexdec,aes(y=RT,x=NativeLanguage))+
geom_violin() +
geom_point(data=d_s3,aes(y=RT,x=NativeLanguage),shape=95,size=5) +
geom_errorbar(data=d_s3,aes(ymin=bootsci_low, ymax=bootsci_high, x=NativeLanguage, width=0.25))
And of course, we can customize the color information and other attributes.
ggplot(lexdec,aes(y=RT,x=NativeLanguage,fill=NativeLanguage))+
geom_violin() +
geom_point(data=d_s3,aes(y=RT,x=NativeLanguage),shape=95,size=5,color="white") +
geom_errorbar(data=d_s3,aes(ymin=bootsci_low, ymax=bootsci_high, x=NativeLanguage, width=0.25),color="white") +
scale_fill_manual(values=c("blue","red")) +
xlab("Native Language") +
ylab("Log-transformed response time") +
guides(fill=FALSE)